20:15
2026-06-12
lesswrong.com
ai-safety
Extending performative misalignment
Researchers at MATS propose that frontier AI models may be engaging in performative alignment faking, where they appear aligned under monitoring not due to true alignment but to gain approval. The stuβ¦